SYS 6018 | Spring 2021 | University of Virginia


1 Introduction

Tell the reader what this project is about. Motivation.

2 Training Data / EDA

Load data, explore data, etc.

# Load Required Packages
library(tidyverse)
library(pROC)
library(randomForest)
library("GGally")
library(gridExtra)
library(plotly)
#library(reticulate)
library(regclass)
#library(ROSE)
library(MLeval)
library(ggplot2)
library(purrr)
library(broom)
url = 'HaitiPixels.csv'
#url = 'https://collab.its.virginia.edu/access/lessonbuilder/item/1707832/group/17f014a1-d43d-4c78-a5c6-698a9643404f/Module3/HaitiPixels.csv' #this url is beng 
haiti <- read_csv(url)
print(dim(haiti))
#> [1] 63241     4
head(haiti)
#> # A tibble: 6 x 4
#>   Class        Red Green  Blue
#>   <chr>      <dbl> <dbl> <dbl>
#> 1 Vegetation    64    67    50
#> 2 Vegetation    64    67    50
#> 3 Vegetation    64    66    49
#> 4 Vegetation    75    82    53
#> 5 Vegetation    74    82    54
#> 6 Vegetation    72    76    52

The dataframe contains 4 columns, and 63,241 rows. The Class column contains the correct label for the observation. Red, Green and Blue parameters are NEED TO INCLUDE CORRECT DEFINITION

2.1 Class Factor

To prepare the data for exploratory data analysis I make Class a factor.

haiti %>% 
  mutate(Class = factor(Class)) 
#> # A tibble: 63,241 x 4
#>    Class        Red Green  Blue
#>    <fct>      <dbl> <dbl> <dbl>
#>  1 Vegetation    64    67    50
#>  2 Vegetation    64    67    50
#>  3 Vegetation    64    66    49
#>  4 Vegetation    75    82    53
#>  5 Vegetation    74    82    54
#>  6 Vegetation    72    76    52
#>  7 Vegetation    71    72    51
#>  8 Vegetation    69    70    49
#>  9 Vegetation    68    70    49
#> 10 Vegetation    67    70    50
#> # ... with 63,231 more rows

Examine the numbers and percentages in each of the 5 classes:

haiti %>%
  group_by(Class) %>%
  summarize(N = n()) %>%
  mutate(Perc = round(N / sum(N), 2) * 100)
#> # A tibble: 5 x 3
#>   Class                N  Perc
#> * <chr>            <int> <dbl>
#> 1 Blue Tarp         2022     3
#> 2 Rooftop           9903    16
#> 3 Soil             20566    33
#> 4 Various Non-Tarp  4744     8
#> 5 Vegetation       26006    41

2.1.0.1 Observations:

The records are not evenly distributed between the categories. Of the Classes Blue Tarp, our “positive” category if we are thinking a binary positive/negative identification, is only 3% of our sample. Soil and Vegetation make up the majority of our sample at 74%.

2.2 Binary Class Factor vs. 5 Class Factor

It will be interesting to see performance predicting each of these categories, or a binary is or is not Blue Tarp.

2.2.1 Create Binary DataFrame

Create a DataFrame that is only Blue Tarp, or not Blue Tarp:
  • 0 == Not a Blue Tarp
  • 1 == Is a Blue Tarp

After reviewing box plots for the 2-class data set, I also created two new calculated variables:
1. GBSqr = (Green + Blue)^2 * .001
2. RBSqr = (Red + Blue)^2 * .001

I created these to continue using the Red and Green values, but I wanted to increase the difference in median value difference between the positive and negative classes. There is significant interplay in color values between Red, Green, and Blue in identifying the correct shade or blue, and I wanted to continue using Red and Green values but increase the linear separability between the classes. The 0.01 multiplier is to return the number scale to a range similar to standard RGB values.

haitiBinary =  haiti %>%
  mutate(ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))

haitiBinarySqrs = haiti %>%
  mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))

Examine the numbers and percentages in each of the 2 classes:

haitiBinary %>%
  group_by(ClassBinary) %>%
  summarize(N = n()) %>%
  mutate(Perc = round(N / sum(N), 2) * 100)
#> # A tibble: 2 x 3
#>   ClassBinary     N  Perc
#> * <fct>       <int> <dbl>
#> 1 0           61219    97
#> 2 1            2022     3

2.2.2 How are red, blue and green values distributed between the 5 classes?

redplot <- ggplot(haiti, aes(x=Class, y=Red)) + 
  geom_boxplot(col='red')

greenplot <- ggplot(haiti, aes(x=Class, y=Green)) + 
  geom_boxplot(col='darkgreen')

blueplot <- ggplot(haiti, aes(x=Class, y=Blue)) + 
  geom_boxplot(col='darkblue')

grid.arrange(redplot, greenplot, blueplot)

2.2.3 How are red, blue and green values distributed between the 2 classes?

redplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Red)) + 
  geom_boxplot(col='red')

greenplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Green)) + 
  geom_boxplot(col='darkgreen')

blueplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Blue)) + 
  geom_boxplot(col='darkblue')

grid.arrange(redplotB, greenplotB, blueplotB)

### How are red, blue and green values distributed between the 2 classes with the square values for Red + Blue and Green Blue?

redplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=RBSqr)) + 
  geom_boxplot(col='red')

greenplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=GBSqr)) + 
  geom_boxplot(col='darkgreen')

blueplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=Blue)) + 
  geom_boxplot(col='darkblue')

grid.arrange(redplotB, greenplotB, blueplotB)

2.2.3.1 Box Plot Observations

For the 5-class box plots:

“Blue Tarp” as the “positive” result, and other results as the “negative” result.

Regarding the box plot of the five categories, of interest is that “Soil” and “Vegetation” are relatively unique in their RGB distributions. “Rooftop” and “Various Non-Tarp” are more similar in their RBG distributions

For the 2-class box plots:

If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the ranges of red and green are much smaller for blue tarp than non-blue-tarp.

Generally, the values of red have a larger range for negative results than for positive results, and the positive results have a similar median to the negative results.

Green values have a larger range for negative results than for positive results, and the positive results have a higher median than the negative results.

There is almost no overlap in the blue data with non-blue tarps, and blue tarps.

For the 2-class box plots with the additive square values:

If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the RBSqr and GBSqr values have much less overlap than without the additive square variables.

The values of RBSqr have a larger range for negative results than for negative results, and median is significantly greater in the positive results.

GBSqr values have a larger range for negative results than for positive results. The positive results have a significantly higher median than the negative results.

There is almost no overlap in the blue data with non-blue tarps, and blue tarps.

2.2.4 View the correlation between Red, Green and Blue

These correlations make sense as the pixels were of highly saturated colors, that are not pure Blue, Red or Green. There are few pixels in the data set with low values for R,G,B.

#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haiti, progress = F)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haitiBinary[-1], progress = F)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggpairs(haitiBinarySqrs[-1], progress = F)
#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function

#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function

#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function

The RBSqr and GBSqr have significantly less variance in their values, and better differentiation between the 2 classes than the Red and Green variables. I will be using these transformed variables in my models.

2.2.5 3-D Scatterplot

To view the relationship between the Red, Green, and Blue values between the five classes, and the binary classes, an interactive 3-D scatter plot is illustrative.

2.2.5.1 Five-Class 3-D Scatterplot

fiveCat3D = plot_ly(x=haiti$Red, y=haiti$Blue, z=haiti$Green, type="scatter3d", mode="markers", color=haiti$Class, colors = c('blue2','azure4','chocolate4','coral2','chartreuse4'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))

fiveCat3D = fiveCat3D %>%
  layout(title="5 Category RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))

fiveCat3D

5-Class 3-D Scatter Plot Observations
One can see that there are discernible groupings of pixel categories by RGB values. Unsurprisingly, the blue tarps are higher blue values, but they do have a range of red and green values.

The 3D scatter plot is particularly useful because, by zooming in, one can see that while the ‘Blue Tarp’ values are generally distinct, there is a space in the 3D plot with mingling of “blue tarp” pixels and other pixel categories. That area of the data will provide a challenge for our model.

2.2.5.1.1 Two-Class 3-D Scatterplot
binary3D = plot_ly(x=haitiBinarySqrs$RBSqr, y=haitiBinarySqrs$Blue, z=haitiBinarySqrs$GBSqr, type="scatter3d", mode="markers", color=haitiBinary$ClassBinary, colors = c('red','blue2'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))

binary3D = binary3D %>%
  layout(title="Binary RBG Plot", scene = list(xaxis = list(title = "RBSqr", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "GBSqr", color="green")))

binary3D

2-Class 3-D Scatter Plot Observations With Blue, GBSqr, and RBSqr
Similar to the five category 3D scatter plot, the binary scatter plot shows distinct groupings for blue tarp and non-blue-tarp. There is a clear linear boundary between the blue tarp and non-blue tarp observations.

2.2.6 Parameter Selection:

Based on EDA, I am hopeful that my models will perform well using the following predictors:
  1. Red
  2. Green
  3. Blue
  4. GBSqr: ((Green + Blue)^2) * .001
  5. RBSqr: ((Red + Blue)^2) * .001

3 Model Training

3.1 Set-up

Normalization does not need to be considered because the ranges of Red, Green and Blue are the same.

I am using the 2-Class data set for the following reasons:
  1. The distinctions in the 2-Class data set, as seen in the 3-D scatterplot, are clear.
  2. The stated problem is to classify ‘Blue Tarp’ from the other classes. Classifying the other classes is not of interest.
  3. I am using 10-fold cross-validation to evaluate the models.



3.1.1 Training and Test Data:

I will hold out 20% of the data set for testing/validation.

library(caret)
library(boot)

3.2 80/20 Train/Test Split

train = haitiBinarySqrs

#set.seed(1976)
#sample_size = floor(0.8*nrow(haitiBinarySqrs))
# randomly split data in r
#picked = sample(seq_len(nrow(haitiBinarySqrs)),size = sample_size)
#train = haitiBinarySqrs[picked,]
#test = haitiBinarySqrs[-picked,]

3.3 Cross-Validation Performance

For logistic regression, LDA, QDA, and ???KNN Cross-Validation threshold performance used ROC for tuning.

The following performance measures are collected for both the 10-fold cross-validation and the hold-out/testing/validation data:
  • AUROC
  • True Positive Rate
  • False Positive Rate
  • Precision



For the Models: * No: Not a Blue Tarp is Negative * Yes: Is a Blue Tarp is Positive

3.4 Logistic Regression

Per our course’s Module 3 instruction, logistic regression is typically used when there are 2 classes. I will be using the haitiBinary dataframe with two classes:

Reset the level names to enable the caret functions for the ROC curve.

levels(train$ClassBinary)
#> [1] "0" "1"
levels(train$ClassBinary)=c("No","Yes")

levels(train$ClassBinary)
#> [1] "No"  "Yes"
fct_count(train$ClassBinary)
#> # A tibble: 2 x 2
#>   f         n
#>   <fct> <int>
#> 1 No    61219
#> 2 Yes    2022
set.seed(1976)
# number: number of folds for cross validation
trctrl <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, number = 10, repeats = 2)

log.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "glmnet", trControl=trctrl, tuneLength = 10)
#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
log.cv.model
#> glmnet 
#> 
#> 63241 samples
#>     5 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times) 
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   alpha  lambda        ROC        Sens       Spec     
#>   0.1    1.922525e-05  0.9994333  0.9992894  0.8550980
#>   0.1    4.441283e-05  0.9994213  0.9993139  0.8536141
#>   0.1    1.025994e-04  0.9991691  0.9996080  0.8323501
#>   0.1    2.370179e-04  0.9985715  0.9999020  0.7875933
#>   0.1    5.475421e-04  0.9973944  1.0000000  0.7030300
#>   0.1    1.264893e-03  0.9943187  1.0000000  0.5677669
#>   0.1    2.922068e-03  0.9864913  1.0000000  0.4448581
#>   0.1    6.750356e-03  0.9711792  1.0000000  0.1026179
#>   0.1    1.559420e-02  0.9355043  1.0000000  0.0000000
#>   0.1    3.602462e-02  0.8838932  1.0000000  0.0000000
#>   0.2    1.922525e-05  0.9995102  0.9988892  0.8706762
#>   0.2    4.441283e-05  0.9994448  0.9992649  0.8575721
#>   0.2    1.025994e-04  0.9992065  0.9995508  0.8348193
#>   0.2    2.370179e-04  0.9986487  0.9999020  0.7925377
#>   0.2    5.475421e-04  0.9975380  1.0000000  0.7141552
#>   0.2    1.264893e-03  0.9947265  1.0000000  0.5816161
#>   0.2    2.922068e-03  0.9872496  1.0000000  0.4500537
#>   0.2    6.750356e-03  0.9717887  1.0000000  0.1184510
#>   0.2    1.559420e-02  0.9337115  1.0000000  0.0000000
#>   0.2    3.602462e-02  0.8745199  1.0000000  0.0000000
#>   0.3    1.922525e-05  0.9995214  0.9988157  0.8741379
#>   0.3    4.441283e-05  0.9994656  0.9991588  0.8593023
#>   0.3    1.025994e-04  0.9992423  0.9994773  0.8372933
#>   0.3    2.370179e-04  0.9987313  0.9998857  0.7992160
#>   0.3    5.475421e-04  0.9976990  1.0000000  0.7228076
#>   0.3    1.264893e-03  0.9951103  1.0000000  0.5999122
#>   0.3    2.922068e-03  0.9880966  1.0000000  0.4569807
#>   0.3    6.750356e-03  0.9725693  1.0000000  0.1377457
#>   0.3    1.559420e-02  0.9337141  1.0000000  0.0000000
#>   0.3    3.602462e-02  0.8592215  1.0000000  0.0000000
#>   0.4    1.922525e-05  0.9995254  0.9987504  0.8775996
#>   0.4    4.441283e-05  0.9994775  0.9991424  0.8617714
#>   0.4    1.025994e-04  0.9992780  0.9994528  0.8422365
#>   0.4    2.370179e-04  0.9988200  0.9997876  0.8088572
#>   0.4    5.475421e-04  0.9978658  0.9999837  0.7364105
#>   0.4    1.264893e-03  0.9954638  1.0000000  0.6157367
#>   0.4    2.922068e-03  0.9888309  1.0000000  0.4698422
#>   0.4    6.750356e-03  0.9732470  1.0000000  0.1592560
#>   0.4    1.559420e-02  0.9349837  1.0000000  0.0000000
#>   0.4    3.602462e-02  0.8502785  1.0000000  0.0000000
#>   0.5    1.922525e-05  0.9995315  0.9986932  0.8800712
#>   0.5    4.441283e-05  0.9994884  0.9990362  0.8635041
#>   0.5    1.025994e-04  0.9993280  0.9993874  0.8456994
#>   0.5    2.370179e-04  0.9989185  0.9997305  0.8135578
#>   0.5    5.475421e-04  0.9980286  0.9999837  0.7514900
#>   0.5    1.264893e-03  0.9958251  1.0000000  0.6375006
#>   0.5    2.922068e-03  0.9894547  1.0000000  0.4881371
#>   0.5    6.750356e-03  0.9732489  1.0000000  0.1955982
#>   0.5    1.559420e-02  0.9328537  1.0000000  0.0000000
#>   0.5    3.602462e-02  0.8502785  1.0000000  0.0000000
#>   0.6    1.922525e-05  0.9995352  0.9986524  0.8840267
#>   0.6    4.441283e-05  0.9994988  0.9989627  0.8674609
#>   0.6    1.025994e-04  0.9993739  0.9993058  0.8503987
#>   0.6    2.370179e-04  0.9990196  0.9996406  0.8219639
#>   0.6    5.475421e-04  0.9982071  0.9999673  0.7655928
#>   0.6    1.264893e-03  0.9962020  1.0000000  0.6595120
#>   0.6    2.922068e-03  0.9901824  1.0000000  0.5014925
#>   0.6    6.750356e-03  0.9741437  1.0000000  0.2477759
#>   0.6    1.559420e-02  0.9306178  1.0000000  0.0000000
#>   0.6    3.602462e-02  0.8502785  1.0000000  0.0000000
#>   0.7    1.922525e-05  0.9995399  0.9986034  0.8889723
#>   0.7    4.441283e-05  0.9995118  0.9988811  0.8726552
#>   0.7    1.025994e-04  0.9994190  0.9992649  0.8543555
#>   0.7    2.370179e-04  0.9991234  0.9996161  0.8313625
#>   0.7    5.475421e-04  0.9983812  0.9999673  0.7833927
#>   0.7    1.264893e-03  0.9965943  1.0000000  0.6844876
#>   0.7    2.922068e-03  0.9909783  1.0000000  0.5232527
#>   0.7    6.750356e-03  0.9758676  1.0000000  0.3014400
#>   0.7    1.559420e-02  0.9282502  1.0000000  0.0000000
#>   0.7    3.602462e-02  0.8502785  1.0000000  0.0000000
#>   0.8    1.922525e-05  0.9995437  0.9985380  0.8931766
#>   0.8    4.441283e-05  0.9995260  0.9987422  0.8778471
#>   0.8    1.025994e-04  0.9994691  0.9991179  0.8617727
#>   0.8    2.370179e-04  0.9992252  0.9994528  0.8405075
#>   0.8    5.475421e-04  0.9985716  0.9998693  0.7979771
#>   0.8    1.264893e-03  0.9969599  0.9999918  0.7129213
#>   0.8    2.922068e-03  0.9921717  1.0000000  0.5462493
#>   0.8    6.750356e-03  0.9779321  1.0000000  0.3669743
#>   0.8    1.559420e-02  0.9252658  1.0000000  0.0000000
#>   0.8    3.602462e-02  0.8502785  1.0000000  0.0000000
#>   0.9    1.922525e-05  0.9995464  0.9984809  0.8993598
#>   0.9    4.441283e-05  0.9995373  0.9986279  0.8874884
#>   0.9    1.025994e-04  0.9995039  0.9989056  0.8706762
#>   0.9    2.370179e-04  0.9993457  0.9992404  0.8513852
#>   0.9    5.475421e-04  0.9988153  0.9996488  0.8189997
#>   0.9    1.264893e-03  0.9975457  0.9999837  0.7435790
#>   0.9    2.922068e-03  0.9935624  1.0000000  0.5850729
#>   0.9    6.750356e-03  0.9805630  1.0000000  0.4243379
#>   0.9    1.559420e-02  0.9217851  1.0000000  0.0000000
#>   0.9    3.602462e-02  0.8502785  1.0000000  0.0000000
#>   1.0    1.922525e-05  0.9995488  0.9984074  0.9025752
#>   1.0    4.441283e-05  0.9995466  0.9984727  0.8993598
#>   1.0    1.025994e-04  0.9995347  0.9986360  0.8869934
#>   1.0    2.370179e-04  0.9994747  0.9989872  0.8686973
#>   1.0    5.475421e-04  0.9990646  0.9992731  0.8412501
#>   1.0    1.264893e-03  0.9982243  0.9998857  0.7888309
#>   1.0    2.922068e-03  0.9951431  1.0000000  0.6357618
#>   1.0    6.750356e-03  0.9841582  1.0000000  0.4614276
#>   1.0    1.559420e-02  0.9167382  1.0000000  0.0000000
#>   1.0    3.602462e-02  0.8502785  1.0000000  0.0000000
#> 
#> ROC was used to select the optimal model using the largest value.
#> The final values used for the model were alpha = 1 and lambda = 1.922525e-05.

3.4.1 Logistic Regression Performance

10-fold cross-validation training resulted in a best-threshold of 1.0 for the model when ROC was used as the performance metric.

caret::confusionMatrix(log.cv.model)
#> Cross-Validated (10 fold, repeated 2 times) Confusion Matrix 
#> 
#> (entries are percentual average cell counts across resamples)
#>  
#>           Reference
#> Prediction   No  Yes
#>        No  96.6  0.3
#>        Yes  0.2  2.9
#>                             
#>  Accuracy (average) : 0.9953
  • TPR: 96.6 / 96.8 = 0.998
  • FPR: 0.3 / 3.2 = 0.094
  • Precision: 96.6 / 96.9 = 0.997

3.4.2 Logistic Regression Cross-Validation ROC Curve:

result = evalm(log.cv.model)
#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***

#> Group 1 Optimal Informedness = 0.986587402018881
#> Group 1 AUC-ROC = 1

result$roc

The Logistic Regression ROC-AUC for the 10-fold cross-validated training data is: 1.0.

3.5 LDA

Train the LDA model using 10-fold cross validation. Tuning performed using ROC.

set.seed(1976)

lda.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "lda", trControl=trctrl, tuneLength = 10)
#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
lda.cv.model
#> Linear Discriminant Analysis 
#> 
#> 63241 samples
#>     5 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times) 
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ... 
#> Resampling results:
#> 
#>   ROC       Sens       Spec     
#>   0.994494  0.9992241  0.8466895
caret::confusionMatrix(lda.cv.model)
#> Cross-Validated (10 fold, repeated 2 times) Confusion Matrix 
#> 
#> (entries are percentual average cell counts across resamples)
#>  
#>           Reference
#> Prediction   No  Yes
#>        No  96.7  0.5
#>        Yes  0.1  2.7
#>                             
#>  Accuracy (average) : 0.9943

3.5.1 LDA Training ROC Curve:

result.lda = evalm(lda.cv.model)
#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***

#> Group 1 Optimal Informedness = 0.908116090617833
#> Group 1 AUC-ROC = 0.99

result.lda$roc

The LDA ROC-AUC for the 10-fold cross-validated training data is: 0.99.

3.6 QDA

Train the QDA model using 10-fold cross validation. Tuning performed using ROC.

set.seed(1976)

qda.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "qda", trControl=trctrl, tuneLength = 10)
#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
qda.cv.model
#> Quadratic Discriminant Analysis 
#> 
#> 63241 samples
#>     5 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times) 
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ... 
#> Resampling results:
#> 
#>   ROC        Sens       Spec     
#>   0.9973852  0.9977785  0.8949105
caret::confusionMatrix(qda.cv.model)
#> Cross-Validated (10 fold, repeated 2 times) Confusion Matrix 
#> 
#> (entries are percentual average cell counts across resamples)
#>  
#>           Reference
#> Prediction   No  Yes
#>        No  96.6  0.3
#>        Yes  0.2  2.9
#>                             
#>  Accuracy (average) : 0.9945

3.6.1 QDA Training ROC Curve:

result.qda = evalm(qda.cv.model)
#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***

#> Group 1 Optimal Informedness = 0.945397391140487
#> Group 1 AUC-ROC = 1

result.qda$roc

3.7 KNN

3.7.1 Tuning Parameter \(k\)

set.seed(1976)

knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid   = expand.grid(k = 1:21))
#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
knn.cv.model
#> k-Nearest Neighbors 
#> 
#> 63241 samples
#>     5 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times) 
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   ROC        Sens       Spec     
#>    1  0.9761098  0.9984809  0.9428852
#>    2  0.9898714  0.9983093  0.9453543
#>    3  0.9945459  0.9984972  0.9537555
#>    4  0.9968216  0.9984809  0.9540092
#>    5  0.9973851  0.9984400  0.9567332
#>    6  0.9980277  0.9983339  0.9601948
#>    7  0.9984119  0.9983175  0.9621738
#>    8  0.9986689  0.9983502  0.9589584
#>    9  0.9989239  0.9983094  0.9579757
#>   10  0.9991942  0.9983502  0.9557492
#>   11  0.9992105  0.9983829  0.9569819
#>   12  0.9994715  0.9983910  0.9564881
#>   13  0.9994774  0.9983829  0.9562381
#>   14  0.9994872  0.9983910  0.9554980
#>   15  0.9994948  0.9984482  0.9545091
#>   16  0.9994968  0.9983992  0.9552456
#>   17  0.9995010  0.9984155  0.9547554
#>   18  0.9995001  0.9983829  0.9537641
#>   19  0.9994956  0.9984155  0.9540104
#>   20  0.9994977  0.9983910  0.9537665
#>   21  0.9994940  0.9983747  0.9532690
#> 
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was k = 17.
set.seed(1976)

knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid   = expand.grid(k = 17:35))
#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
knn.cv.model
#> k-Nearest Neighbors 
#> 
#> 63241 samples
#>     5 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times) 
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   ROC        Sens       Spec     
#>   17  0.9995010  0.9984155  0.9550029
#>   18  0.9995001  0.9983992  0.9535166
#>   19  0.9994956  0.9984155  0.9540104
#>   20  0.9994977  0.9984237  0.9527728
#>   21  0.9994940  0.9983747  0.9535166
#>   22  0.9994928  0.9983992  0.9532703
#>   23  0.9994911  0.9983339  0.9545054
#>   24  0.9994866  0.9983665  0.9542604
#>   25  0.9994843  0.9983093  0.9540116
#>   26  0.9996071  0.9983420  0.9537653
#>   27  0.9997275  0.9983502  0.9545067
#>   28  0.9997262  0.9983665  0.9542616
#>   29  0.9997247  0.9983420  0.9535178
#>   30  0.9997221  0.9983420  0.9542604
#>   31  0.9997199  0.9983420  0.9542604
#>   32  0.9997187  0.9983093  0.9530227
#>   33  0.9997147  0.9983012  0.9532727
#>   34  0.9997127  0.9983093  0.9532727
#>   35  0.9997103  0.9982685  0.9535202
#> 
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was k = 27.
set.seed(1976)

knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid   = expand.grid(k = 27:51))
#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
knn.cv.model
#> k-Nearest Neighbors 
#> 
#> 63241 samples
#>     5 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times) 
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   ROC        Sens       Spec     
#>   27  0.9997275  0.9983502  0.9545067
#>   28  0.9997262  0.9983502  0.9537641
#>   29  0.9997247  0.9983420  0.9535178
#>   30  0.9997221  0.9983420  0.9535190
#>   31  0.9997199  0.9983420  0.9542604
#>   32  0.9997187  0.9983093  0.9525277
#>   33  0.9997147  0.9983012  0.9532727
#>   34  0.9997127  0.9983175  0.9525326
#>   35  0.9997103  0.9982685  0.9532727
#>   36  0.9997113  0.9982440  0.9532763
#>   37  0.9997088  0.9982685  0.9525326
#>   38  0.9997065  0.9982440  0.9508011
#>   39  0.9997040  0.9982930  0.9508011
#>   40  0.9997064  0.9983012  0.9495647
#>   41  0.9997078  0.9983094  0.9498098
#>   42  0.9997053  0.9983257  0.9490684
#>   43  0.9997041  0.9982930  0.9480808
#>   44  0.9997015  0.9982848  0.9478332
#>   45  0.9997018  0.9983012  0.9475857
#>   46  0.9997006  0.9982848  0.9463530
#>   47  0.9997002  0.9983094  0.9473382
#>   48  0.9996986  0.9982930  0.9446191
#>   49  0.9996975  0.9983012  0.9461018
#>   50  0.9996963  0.9982930  0.9448678
#>   51  0.9996945  0.9982930  0.9453617
#> 
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was k = 27.
caret::confusionMatrix(knn.cv.model)
#> Cross-Validated (10 fold, repeated 2 times) Confusion Matrix 
#> 
#> (entries are percentual average cell counts across resamples)
#>  
#>           Reference
#> Prediction   No  Yes
#>        No  96.6  0.1
#>        Yes  0.2  3.1
#>                             
#>  Accuracy (average) : 0.9969

3.7.2 KNN k=17 Training ROC Curve:

result.knn = evalm(knn.cv.model)
#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***

#> Group 1 Optimal Informedness = 0.989088354922491
#> Group 1 AUC-ROC = 1

result.knn$roc

3.7.3 How were tuning parameter(s) selected? What value is used? Plots/Tables/etc.

I ran 10-fold cross-validation for several ranges of k:
  • 1 - 21: Returned best k == 17
  • 17 - 35: Returned best k == 27
  • 27 - 51: Returned best k == 27



From 1 - 51, the best k is 27. The tables of ROC, Sensitivity and Specificity were reviewed for each cross-validation training. From these tables one can see that the improvements within the range are in the hundredths of a percentage point of ROC, so k’s in the range of 10 - 51, are all reasonable selections for the cross-validated training data.

3.8 Penalized Logistic Regression (ElasticNet)

3.8.1 Tuning Parameters

NOTE: PART II same as above plus add Random Forest and SVM to Model Training.

3.9 Threshold Selection

4 Results (Cross-Validation)

Model Tuning AUROC Threshold Accuracy TPR FPR Precision
Log Reg N/A 1.0 1.0 0.9953 0.998 0.094 0.997
LDA N/A 0.99 0.9943 0.999 0.156 0.995
QDA N/A 1.0 0.9945 0.998 0.094 0.997
KNN k = 17 1.0 0.997 0.998 0.0313 0.999
Penalized Log Reg
Random Forest
SVM

5 Conclusions

5.0.1 Conclusion #1

5.0.2 Conclusion #2

5.0.3 Conclusion #3

6 Hold-out Data / EDA

Load data, explore data, etc.

7 Results (Hold-Out)

8 Results (Test/Validate Data)

Model Tuning AUROC Threshold Accuracy TPR FPR Precision
Log Reg N/A TBD TBD TBD TBD TBD TBD
LDA N/A TBD TBD TBD TBD TBD TBD
QDA N/A TBD TBD TBD TBD TBD TBD
KNN k = TBD TBD TBD TBD TBD TBD TBD
Penalized Log Reg TBD TBD TBD TBD TBD TBD TBD
Random Forest TBD TBD TBD TBD TBD TBD TBD
SVM TBD TBD TBD TBD TBD TBD TBD

9 Final Conclusions

9.0.1 Conclusion #1

9.0.2 Conclusion #2

9.0.3 Conclusion #3

9.0.4 Conclusion #4

9.0.5 Conclusion #5

9.0.6 Conclusion #6